Michigan Rail Data¶

In this notebook, I will be exploring railroad data from the Michigan Geographic Framework.

Motivation¶

I found this somewhat mysteriously named dataset, "Rails", on detroitdata.org, which led me to the Michigan Department of Transportation (MDOT) ArcGIS page, more clearly titled "State Owned Roads (v17a)". I thought it could be fun to explore.

About the data¶

I found some Metadata from the State of Michigan that helps. It states "No warranty, expressed or implied, is made and no liability is assumed by the State of Michigan as to the accuracy or usability of this data."

Features:

  1. There are rails managed by organizations in surrounding regions, like Canada, Ohio, and Indiana. Maybe I can explore this?
  2. Seems to be intended for ArcGIS visualization, so some columns will need to be filtered out.

Challenges:

  1. It doesn't really have many columns for numerical metrics, not even GPS locations. Many columns are related to GIS, and there's not much I can do with rail length, which appears to be meters.
  2. What does each record or unit of rail mean? Are companies partially responsible for sectioned meters of rail?
  3. Many of the rail companies also operate in other states and Canadian provinces. Are those rails included?
  4. Despite everything, I still liked that Rails reflected real world data with real challenges, so I still wanted to try exploring the data.

If there are updated versions of this dataset in the works, maybe it will have more details in the future.

Installation and Imports¶

I initially with a virtual environment setup for this assignment. I forget what Python installs by default and does not install, so I may have included redundant packages/libraries.

  • I later had some trouble with the env directory in Jupyter, and ended up simply installing the packages to my system. I don't think I have any version conflicts, so I'm not too worried about direct pip installs vs environment pip installs. Either method should work.

Virtual Environment¶

In [16]:
# Mac
!python3 -m venv env
In [ ]:
# Linux or WSL
!python -m venv env

I began with a virtual environment, but was having trouble with how Jupyter Notebook set the directory, so I went with a system install of the libraries.

In [ ]:
!source env/bin/activate
!env/bin/pip install matplotlib pandas numpy bs4 plotly scipy

Direct library install¶

In [ ]:
!pip install matplotlib pandas numpy bs4 plotly scipy
In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import requests
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from bs4 import BeautifulSoup
from scipy.stats import zscore, norm

Preprocessing¶

  • I'll keep the columns I intend to explore. Ex: There's no need for the 'version' column, as all the records are version 17a.
  • 'fename' and 'fename2' are the company names and abbreviations, kind of like airport codes, so I'll change them to 'company' and 'company_id".
In [21]:
rails = pd.read_csv('Rail.csv', usecols=['length', 'fename', 'fename2', 'countyr']).rename(columns={'countyr': 'fips', 'fename': 'company', 'fename2': 'company_id'})
rails.head()
Out[21]:
length company_id company fips
0 940.591284 NS Norfolk Southern Railway 115
1 55.062364 AA Ann Arbor Railroad 115
2 309.898384 CSX CSX Transportation 115
3 352.588757 AA Ann Arbor Railroad 115
4 190.354344 CSX CSX Transportation 115

Adding Counties¶

The fips column has county codes that are Federal Information Processing Standards (FIPS) codes corresponding to counties in Michigan. I can map them to the county and city names with the BeautifulSoup and Requests libraries.

This next cell can really can be skipped, as I ultimately did not use the city column. It's not very useful anyways, as it's just the government office of the county a railroad happens to run through. Ex: a railroad operated by Canadian National Railway with FIPS 163 would map to Detroit, even if the railroad is in a different section of Wayne County that is not Detroit.

  • FIPS 163 maps to Wayne County
  • FIPS 161 maps to Washtenaw County

In the end, I did not end up using the cities, so hard-coding the counties was sufficient.

In [22]:
url = 'https://en.wikipedia.org/wiki/List_of_counties_in_Michigan'
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
tables = soup.find_all('table', {'class': 'wikitable'})
cities = tables[0]
fips_to_city = {}

for row in cities.find_all('tr'):
    cells = row.find_all('td')
    if len(cells) > 0:
        fips_code = cells[0].get_text(strip=True)
        city_name = cells[1].get_text(strip=True)
        
        # Add the FIPS code and city name to the dictionary
        fips_to_city[fips_code] = city_name

I had some trouble getting the county names progammatically, so I hard-coded them as there were only seven counties (MADS 505 life technique).

In [42]:
# fips = rails['fips'].unique()
fips_to_county = {115: 'Monroe', 163: 'Wayne', 161: 'Washtenaw', 93 : 'Livingston', 125: 'Oakland', 99 : 'Macomb', 147: 'St. Clair'}
In [44]:
fips_to_city = {int(k): v for k, v in fips_to_city.items()}
rails['county'] = rails['fips'].map(fips_to_county)
rails['city'] = rails['fips'].map(fips_to_city)
rails.head()
Out[44]:
length company_id company fips county city
0 940.591284 NS Norfolk Southern Railway 115 Monroe Monroe
1 55.062364 AA Ann Arbor Railroad 115 Monroe Monroe
2 309.898384 CSX CSX Transportation 115 Monroe Monroe
3 352.588757 AA Ann Arbor Railroad 115 Monroe Monroe
4 190.354344 CSX CSX Transportation 115 Monroe Monroe

Additional Data Frames¶

I think some groupings the rails by city and company would be helpful later on. I might move/reference these variables if I use them later.

  • On coming back to this cell, I changed city to county, as it made more sense to know the rails in each county rather than the government office location; the government may not even be responsible for the rail, such as when a Canadian company operates in Washtenaw County, having "Ann Arbor" as the location wouldn't be very helpful.
In [25]:
print(rails['length'].describe()) # Basic statistics

# For mapping company acronyms
companies = rails[['company_id', 'company']].drop_duplicates().set_index('company_id').to_dict()['company']

# I want to group the companies and cities by the total rail length they have.
company_rails = rails.groupby('company_id')['length'].sum().reset_index()
county_rails = rails.groupby('county')['length'].sum().reset_index()

# I'll label the column and sort
company_rails = company_rails.sort_values('length', ascending=False)
county_rails = county_rails.sort_values('length', ascending=False)

# Grouping by counts of rail in each company
company_counts = rails.value_counts('company')
county_counts = rails.value_counts('county')
count    3868.000000
mean      340.509784
std       369.659950
min         4.863020
25%        77.068172
50%       210.931104
75%       470.784820
max      2627.409279
Name: length, dtype: float64

Initial Plots¶

For categorical data, I think a basic histogram and bar plot would be a good starting point. With under 4,000 rails, grouping could give some insights as well. Looking at a histogram and boxplot, the data is skewed right, and nearly half of the railroads (~1750 / 3800) seem to be under 150-200 meters. With a mean of 340 meters, likely from the skewness, this seems reasonable.

Railroads from Canada make up a lot more of the data than I initially thought, though it's quite reasonable for Michigan to have railroad cooperation with Canada. Perhaps plotting the counts could be a helpful addition.

Upon further investigation, two Canadian companies, CSX Transportation and Consolidated Rail, acquired a few US companies, so that could explain their large shares in rail length.

In [55]:
# Create subplots
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(16, 4))

# Create histogram
axs[0].hist(rails['length'], bins=15)
axs[0].set_xlabel('Meters')
axs[0].set_ylabel('Number of Rails')
axs[0].set_title('Counts of Rail Lengths')

# Create bar chart
companies_grouped = rails.groupby('company')['length'].sum() # Grouping the railroads by company for the bar chart
num_bar_ticks = np.arange(len(companies_grouped.index))

axs[1].bar(companies_grouped.index, companies_grouped.values)
axs[1].set_xticks(num_bar_ticks, labels=companies_grouped.index, rotation=45, ha='right')
axs[1].set_xlabel('Company', fontsize=10)
axs[1].set_ylabel('Meters')
axs[1].set_title('Rail Length per Company')
Out[55]:
Text(0.5, 1.0, 'Rail Length per Company')
No description has been provided for this image

Creating Plots for Meaningful Visualization¶

I have some ideas from the initial plots and information I would like to gain:

  • Plot the distribution of rail lengths to compare with the histogram.
  • Show the the number and length of rails in each county a company operates.
  • Show the number and length of rails in each county.

For a visualization library, I went with Plotly Express, as their website and tutorials seemed reasonable. Though it has a lot of abstraction, I guess that's why it's called "Express". Visualization isn't a strength of mine, so I thought it would be fine to offload things to Plotly, and its syntax is very similar to Pyplot.

The figures I create here are as follows:

  1. Figure 1: A histogram of the rail lengths like in my exploratory analysis.
  2. Figure 2: A line distribution of the rail lengths.
  3. Figure 3: A bar plot of the length of rail in each county.
  4. Figure 4: A bar plot of the number of rail units in each county.
  5. Figure 5: A bar plot of the length of rail operating by each company.
  6. Figure 6: A bar plot of the number of rail units per each company.
  7. Figure 7: A heat map of the companies operating in each county.

My axes lables and titles are overriden in the figure containing all subplots, but I kept them in case I reused them elsewhere.

In [36]:
# Histogram
fig1 = px.histogram(rails, x='length', nbins=15, title='Counts of Rail Lengths (in meters)')
fig1.update_xaxes(title_text='Length of Rail')
fig1.update_yaxes(title_text='Number of Rails')

mean_rail_length = rails['length'].mean()
std_rail_length = rails['length'].std()

# Create a range for x values that span the data
x_values = np.linspace(rails['length'].min(), rails['length'].max(), 100)
dist = norm.pdf(x_values, mean_rail_length, std_rail_length)
fig2 = go.Figure()
fig2.add_trace(go.Scatter(x=x_values, y=dist, mode='lines', name='Distribution of Lengths'))
fig2.update_xaxes(title_text='Meters of Rail')
fig2.update_yaxes(title_text='Number of Rails')

# Bar plots for county rails
fig3 = go.Figure(data=[go.Bar(x=county_rails['county'], y=county_rails['length'])])
fig3.update_layout(title='Total Rail Meters per County', xaxis_title='County', yaxis_title='Total Meters')

fig4 = go.Figure(data=[go.Bar(x=county_counts.index, y=county_counts.values)])
fig4.update_layout(title='Rail Units per County', xaxis_title='County', yaxis_title='Rail Counts')

# Bar plots for company rails
fig5 = go.Figure(data=[go.Bar(x=company_rails['company_id'], y=company_rails['length'])])
fig5.update_layout(title='Rail Meters per Company', xaxis_title='Company', yaxis_title='Total Meters')

fig6 = go.Figure(data=[go.Bar(x=company_rails['company_id'], y=company_counts.values)])
fig6.update_layout(title='Rail Units per Company', xaxis_title='Company', yaxis_title='Rail Counts')
fig6.update_xaxes(tickangle=-45)

# Heat Map Setup
# Group by county and company (cc) for total rail length
cc_df = rails.groupby(['county', 'company_id'])['length'].sum().reset_index()
cc_df_pivot = cc_df.pivot(index="county", columns="company_id", values="length")
cc_df_pivot = cc_df_pivot.fillna(0)

# Create the heat map
fig7 = px.imshow(
    cc_df_pivot.values,
    labels=dict(x='Company ID', y='County'),
    x=cc_df_pivot.columns,
    y=cc_df_pivot.index
)

plt.close()

Adding Plots to a Dashboard¶

I'll also add the company acronyms to not clutter everything.

In [40]:
# Adding abbreviations, as the company names are really long
abbreviations_text = '\n'.join([f"{k}: {v}" for k, v in companies.items()])
print(abbreviations_text)

titles = ('Rail Lengths',
          'Distribution of Lengths',
          'Rail Length per County', 
          'Rail Units per County',
          'Total Rail Length per Company',
          'Rail Units per Company',
          'Companies Total Railroad Meters by County')

subplot_labels = [
    ('Meters', 'Number of Rails'),
    ('Meters', 'Probability Density'),
    ('County', 'Meters'),
    ('County', 'Rail Counts'),
    ('Company', 'Meters'),
    ('Company', 'Rail Counts'),
    ('Company', 'County')
]

fig = make_subplots(rows=4, cols=2, subplot_titles=titles)

# Add each figure/chart to create a dashboard
fig.add_trace(fig1['data'][0], row=1, col=1)
fig.add_trace(fig2['data'][0], row=1, col=2)
fig.add_trace(fig3['data'][0], row=2, col=1)
fig.add_trace(fig4['data'][0], row=2, col=2)
fig.add_trace(fig5['data'][0], row=3, col=1)
fig.add_trace(fig6['data'][0], row=3, col=2)
fig.add_trace(fig7['data'][0], row=4, col=1)

# Hide the traces
for trace in fig['data']: 
    if(trace['name'] != 'B'): trace['showlegend'] = False

# I modified this code from Maizey to iterate through and add my subplot labels
for i, (x_title, y_title) in enumerate(subplot_labels, start=1):
    row = (i - 1) // 2 + 1
    col = (i - 1) % 2 + 1
    fig['layout'][f'xaxis{row * 2 - (2 - col)}'].title = x_title
    fig['layout'][f'yaxis{row * 2 - (2 - col)}'].title = y_title
    
fig.update_layout(
    height=1000, 
    width=1200, 
    title_text='Michigan Rail Information', coloraxis=dict(colorscale='Viridis'))

fig.show()
NS: Norfolk Southern Railway
AA: Ann Arbor Railroad
CSX: CSX Transportation
CN: Canadian National Railway
IO: Indiana & Ohio Railroad
CR: Consolidated Rail Corporation
DC: Delray Connecting Railroad
HE: Huron & Eastern Railway
CP: Canadian Pacific Railway
DCON: Detroit Connecting Railroad

Analysis¶

  • Hovering over Rail Lengths and Distribution of Lengths verifies my estimate in that 1,880 rails are under length of 200 meters. These units could be for some online visual tool not reflective of real world distance though.

  • The skewness is also much more pronounced in the distribution. I wonder if multiple companies are responsible for really small sections of a single railroad to cause this. Why are there rail records that are significantly longer? Is a single company responsible for that section?

  • When comparing the rail lengths and rail units for companies, the amount of rail a company owns doesn't have a huge difference in length across companies. I.e. there don't seem to be any companies who only build really long rails and companies who build a lot of really short rails.

  • For counties, Monroe the proportion of rail length to rail units seems higher than others. If were to explore further, I would want to dig into the ratios per county and company.

  • Canadian National Railway and CSX Transportation have a huge share of rails in Michigan.

  • The heat map legend is visually clunky, and I could not find a way to move the legend due to time constraints. It is nice to hover and quickly see the companies operating in each county. Interestingly, Ann Arbor Rail is in second place for rail length in Washtenaw County; Norfolk Southern still has more rail. Norfolk Southern operates as an "all or nothing" company. They either have 0 rail length in a county, or top the charts and contend with the two big Canadian companies (CN and CSX).

Final Notes¶

Thoughts on Plotly¶

When using Plotly Express on a few plots at a time, it really helped give me more ideas on what to explore next and even go back in my code to change how I edited the dataframe. I wasn't sure how to really document those steps where I loop back, but Plotly is a nice declarative way to quickly spark ideas on what to look for and scrap unnecessary ideas. I didn't have to get bogged down with iterations, except for when adding my final labels. I found myself having difficulty in fine-tuning features, but I think this would have happened with any library.

The zooming definitely helps for the companies with less rails, but I'm not sure about the panning and selection features Plotly gives. For a presentation, I don't know if I'd want charts to be manipulated unless exploring with a team.

I'm fine with not having fine-grained controls over the interactivity of plots, but more conscientiously programmed features are memory efficient, if that's a concern. I haven't explored many customization options, so it could be possible to change some presets that I'm unaware about.

Thoughts on Rails¶

This was interesting to work with, and I felt extremely limited with rail lengths. It took some iteration to realize what I could explore, and there are still more metrics I haven't tried. I'm glad the FIPS codes were there to look at counties, but I would like to know more about the other columns that seemed intended for ArcGIS.